release: RAG retrieval overhaul (RRF + IVF_PQ), redaction hardening, coverage by jb-thery · Pull Request #47 · jcode-works/jcode-ragmir

jb-thery · 2026-07-03T17:38:23Z

Release PR — develop → main

Promotes the RAG retrieval, security, and coverage overhaul to production. Merging this PR triggers the protected Release npm workflow, which runs semantic-release to derive the version from Conventional Commits and publishes @jcode.labs/ragmir-tts then @jcode.labs/ragmir to npm.

Expected version bump

MINOR (e.g. 2.0.0 → 2.1.0), driven by the feat commits (RRF fusion, IVF_PQ index, config hardening). No breaking public-API changes.

What's in this release

feat(query): weighted Reciprocal Rank Fusion for hybrid retrieval (rank-only, no score calibration). Recall 1.0 on golden set.
feat(store): automatic IVF_PQ vector index above 256 rows (scalability beyond brute force).
fix(redaction): Luhn verification on credit cards, URL username redaction, Stripe/GitLab/Bearer providers.
feat(core): strict config schema, env-override warnings, access-log retention (10 MB cap), bounded LRU Transformers cache, CLI parsers extracted to testable module.
test: suite 132 → 151 cases / 23 files.
chore: dist/ is now gitignored build output.

Pre-merge verification

PR feat(core): RAG retrieval overhaul (RRF + IVF_PQ), redaction hardening, and test coverage #46 → develop merged green (Quality gate, Analyze TypeScript, Commitlint, CodeQL all SUCCESS).
pnpm validate equivalent run locally (lint + audit + check + test 151/151 + build + smoke).
Confidentiality posture verified on real index: zeroTelemetry=true, llmGeneration=false, redactionEnabled=true.

After merge

The Release npm workflow publishes both packages. No local publish, no direct push to main.

chore: back-merge main into develop after 2.0.0 release

Move all packages/*/dist/ directories from committed artifacts to gitignored build output. dist/ is regenerated locally with `pnpm build` before running the CLI, MCP smoke, the library-API demo, or `pnpm validate`. - .gitignore: ignore ragmir-core/dist, ragmir-tts/dist (already ignored for app/landing/license-webhook); add *dist catch-all. - ci.yml: drop the `git diff --exit-code -- dist` step that enforced committed dist, since dist is no longer tracked. - AGENTS.md, CLAUDE.md, README.md, library-api-demo README: document that dist is gitignored and must be built locally; warn against `npx ragmir` for local testing (resolves the published npm package, not the working copy).

Replace the weighted-sum fusion (vector and BM25 scores divided by their max) with Reciprocal Rank Fusion, the standard hybrid-retrieval approach. Each candidate scores `weight / (RRF_K + rank)` per retriever it appears in, summed across retrievers, so the BM25 and vector score distributions never need calibration against each other. The vector retriever is weighted higher (0.7) than the lexical one (0.3) because, with the default local-hash embeddings, vector proximity is the more discriminant signal on small corpora; the lexical weight still lets exact- keyword evidence pull in candidates the vector retriever missed. - RRF_K = 60 (Cormack et al. 2009 constant). - Remove the now-unused weighted-sum helpers (vectorScore, normalizeScore) and the normalizeForMatch import left dead by the refactor. Retrieval recall stays at 1.0 on the sovereign-rag-demo golden set.

Above a 256-row threshold, automatically create an IVF_PQ index on the vector column after writing the table. Below the threshold, LanceDB keeps using an exact flat scan, which is optimal for small corpora and avoids wasted index- training work. - numPartitions ≈ sqrt(rowCount), clamped to [8, 1024] (LanceDB production heuristic). - numSubVectors = 16 (divides the 384-dim local-hash/mxbai-xsmall vectors). - index creation is idempotent (skipped if vector_idx exists) and best-effort (a training failure on edge-case dimensionality leaves the table usable via flat scan rather than failing the ingest). This unblocks query scalability beyond brute-force scan without changing the overwrite write path.

Close two confidentiality gaps and broaden provider coverage in the built-in redaction patterns: - credit_card: add a match-then-verify Luhn check (new RedactionPattern.verify field). Numeric runs that are not valid card numbers (version numbers, account IDs, hex runs) are left untouched instead of being over-redacted. - url_credentials: extend the pattern so both the username and the password are redacted. Previously only the password was stripped, leaking the username. - Add Stripe secret keys (sk_live/rk_live/sk_test), GitLab tokens (glpat-), and generic Bearer tokens. Order the more specific patterns before the generic api_token so they win on overlap. - Add an optional `verify: "luhn"` to the RedactionPattern type so custom patterns can opt into the same check.

…d use Several additive robustness and observability improvements, plus extraction of the CLI option parsers into a testable module: - config: make rawConfigSchema strict so unknown keys (typos) are rejected instead of silently ignored; warn on stderr when an env override (e.g. RAGMIR_TOP_K=abc) is invalid so operators notice a no-op override. - access-log: bound the log growth with a soft cap. When the file exceeds 10 MB, trim it to the most recent 50 000 lines before the next append, so a long-lived MCP server cannot grow it without limit or OOM a usage report. - embeddings: bound the Transformers.js pipeline cache to 3 entries with LRU eviction, and export clearTransformersCache(). destroyIndex now calls it so a re-ingest with a different embedding config does not pin stale ONNX weights. - cli-options: extract the pure option parsers (parsePositiveInt, parseNumber, parseRecallThreshold, audioEngine, audioAllowRemoteModels, audioLanguage, parseAgentInstallScope, parseAgentInstallMode) into a dedicated module so they can be unit-tested without importing commander. cli.ts imports them. parsePositiveInt now rejects fractional input like "1.5" instead of silently truncating via parseInt.

Close the test-coverage gaps the audit identified, raising the suite from 132 to 151 cases across 23 files: - destroy.test.ts (new): destroyIndex removed flag and access-log entry. - query.test.ts: ask() empty-sources and populated cited-retrieval branches. - store.test.ts: empty-text-files manifest round-trip, removal on empty, missing, malformed, and malformed-entry filtering; writeRows zero-rows dropTable and full re-write. - embeddings.test.ts: embedTexts([]) early return and clearTransformersCache. - ingest.test.ts: --rebuild forces a full re-index (reusedFiles === 0). - config.test.ts: strict() rejects unknown keys; non-object config rejected. - access-log.test.ts: retention trims past 10 MB; disabled logging writes nothing. - evaluate.test.ts: miss case (hit=false, bestRank=null, recall=0). - redaction.test.ts: Luhn pass/fail, URL username redacted, Stripe/GitLab/ bearer providers, obfuscation limitation documented. - cli.test.ts (new): all cli-options parsers incl. the MP3-without-engine confidentiality guard and agent scope/mode validation. - text.test.ts (new): tokenize/normalizeForMatch (the BM25 foundation).

…y-overhaul feat(core): RAG retrieval overhaul (RRF + IVF_PQ), redaction hardening, and test coverage

github-actions · 2026-07-03T17:51:22Z

🎉 This PR is included in version 2.1.0 🎉

The release is available on:

v2.1.0
GitHub release

Your semantic-release bot 📦🚀

jb-thery added 8 commits July 3, 2026 17:47

Merge pull request #45 from jcode-works/chore/backmerge-main-after-2.0.0

21a6fe3

chore: back-merge main into develop after 2.0.0 release

Merge pull request #46 from jcode-works/feature/rag-retrieval-securit…

c6253d2

…y-overhaul feat(core): RAG retrieval overhaul (RRF + IVF_PQ), redaction hardening, and test coverage

jb-thery merged commit 17e20a1 into main Jul 3, 2026
10 checks passed

github-actions Bot added the released label Jul 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

release: RAG retrieval overhaul (RRF + IVF_PQ), redaction hardening, coverage#47

release: RAG retrieval overhaul (RRF + IVF_PQ), redaction hardening, coverage#47
jb-thery merged 8 commits into
mainfrom
develop

jb-thery commented Jul 3, 2026

Uh oh!

Uh oh!

github-actions Bot commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

jb-thery commented Jul 3, 2026

Release PR — develop → main

Expected version bump

What's in this release

Pre-merge verification

After merge

Uh oh!

Uh oh!

github-actions Bot commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant